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I use a parametric, bijective transforniation to generate heavy tail versions Y of an arbitrary random 
variable (RV) X ^ Fx, by similar concepts as in Goerg (2011) for skewed RVs. The tail behaviour of the 
so-called heavy tail Lambert W x Fx RV Y depends on a tail parameter S > 0; for S = 0, Y = X, for 
S > Y has heavier tails than X. For X being Gaussian, this meta- family of heavy-tailed distributions 
reduces to Tukey's h distribution. Lambert's W function provides an explicit inverse transformation, which 
can be used to remove skewness and heavy tails from data and then apply standard methods and models 
to this so obtained "nice" (Gaussianised) data. The optimal inverse transformation can be estimated by 
maximum likelihood. This transformation based approach to heavy tails also yields analytical, concise and 
simple expressions for the cumulative distribution (cdf) Gyiy) and probability density function (pdf) gviy)- 
As a special case, I present explicit expressions for Tukey's h pdf and cdf - to the authors knowledge for the 
first time in the literature. Applications to a simulated Cauchy sample, S&P 500 log-returns, as well as solar 
flares data demonstrate the usefulness of the introduced methodology. 

The R package LambertW contains a wide range of methods presented here and is publicly available at 
CRAN. 

Keywords: Gaussianising, heavy tails, power law, Tukey's h distribution, Lambert W, kurtosis, transfor- 
mation of random variables; latent variables. 
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1 Introduction 



Both theory and statistical practice are tightly linked to Gaussianity. In theory, many statistical models 
require the data, noise, or parameters to have a (multivariate) Gaussian distribution: i) t and F-tests in 
regression rely on the assumption of (approximately) Gaussian errors; ii) pattern recognition or de-noising 
methods for images often model the additive noise as a Gaussian random field (Achim, Tsakalides, and 
Bezerianos, 2003); iii) non-standard distributions can be modelled by a mixture of Gaussians (Blum, Zhang, 
Sadler, and Kozick, 1999); iv) the Kalman filter is known to be optimal - in its basic form - for Gaussian 
errors (Friedland, 2005); v) many time series models are based on the building block of a Gaussian white 
noise sequence et (Brockwell and Davis, 1998; Engle, 1982; Granger and Joyeux, 2001). In all the above cases, 
the model M^f, parameter estimates 6 and their standard errors, and many other statistical properties, are 
then studied - all based on the ideal(istic) assumption of Gaussianity. 

In practice, however, data/noise is rarely Gaussian but often exhibits asymmetry and/or heavy tails; for 
example wind speed data (Field, 2004), human dynamics (Vazquez, Oliveira, Dezso, Goh, Kondor, and 
Barabasi, 2006), or Internet traffic data (Gidlund and Debernardi, 2009). Particularly notable examples are 
financial data (Cont, 2001; Kim and White, 2003) and speech signals (Aysal and Barner, 2006) which almost 
exclusively exhibit heavy tails. Thus a model A4j\f developed for the Gaussian case does not necessarily 
provide accurate inference anymore. 

One way to overcome this shortcoming is to replace the Gaussian distribution in A4j\f with a heavy-tailed 
distribution G and study properties of the new model Mg- i) regression with Cauchy errors (Smith, 1973); 
ii) image denoising for a-stable noise (Achim et al., 2003); iii) non-Gaussian mixture models to approximate 
the distribution of a-stable processes (Swami, 2000); iv) Kalman filtering for Cauchy (Idan and Speyer, 
2010; Nezafat and Amindavar, 2001) or Levy noise (Ahn and Feldman, 1999); v) forecasting long memory 
processes with heavy tail or general non-Gaussian innovations (How, 2000; Palma and Zevallos, 2011), or 
ARMA Modelling of electricity loads with hyperbolic noise (Nowicka-Zagrajek and Weron, 2002). 

While this fundamental approach to solve each problem is attractive from a theoretical perspective, it can 
become unsatisfactory from a practical viewpoint. Many successful models in the literature assume at some 
point Gaussianity, their theory for the Normal case is very well understood, many algorithms are implemented 
for the simple Gaussian case and not for the very particular setting of a Cauchy/Levy/a-stable model, thus 
developing models based on a completely unrelated distribution G is like throwing out the (Gaussian) baby 
with the bathwater. 

It would be very useful if we could transform a Gaussian RV X to a (skewed) heavy-tailed RV Y and 
vice versa, and thus rely on the knowledge - and software - for the well-understood Gaussian case, while 
still Modelling the skewness and excess kurtosis in the data. Optimally such a transformation should: a) be 
bijective, so we can go back and forth between the skewed/heavy-tailed distribution Griy) and the Gaussian 
Fx{x)\ b) include Normality as a special case, so we can test for skewness/heavy tails; and c) be parametric 
(r), Y = Ht{X)^ to estimate the optimal transformation efficiently from the sample y — {yi, . . . , i/n)- 

Liu, Lafferty, and Wasserman (2009) introduce a semi-parametric approach, where Y has a non-paranormal 
distribution if f{Y) ^ J^{n, cr^) and /(•) is an increasing smooth function. It is semi-parametric in the sense 
that /(•) is estimated non-parametrically. This leads to a greater flexibility in distribution shapes for Y than 
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Figure 1: Schematic view of the heavy tail Lambert W x Fx framework, (left) Latent (Gaussian) input 
X ^ Fx- Hr{X) from (5) transforms (solid arrows) X to Y ^ Lambert W x Fx and introduces heavy 
tails, (right) Observed heavy tail world Y and y: (1) use WV(-) to back-transform y to latent "Normally" 
tailed data x^., (2) use model M^f of your choice (regression, time series models, hypothesis test, quantile 
estimation, graph estimation, etc.) to make inference on x,-, and (3) convert results back to the original 
"heavy-tailed world" of y. 



94 any fixed parametric transformation, but it suffers from two drawbacks: i) non-parametric methods have 

95 slower convergence rates compared to parametric techniques, and ii) one identifiability condition for /(•) is 

96 that E/(y) = EY and Yf{Y) = YY. While the first point is the inherent cost for non-parametric generality, 

97 the second requires Y to have finite mean and variance, which is especially limiting for heavy-tailed data 

98 where this condition is often not met. Thus here we will study parametric transformations which are not as 

99 general as a non-parametric variant, but do not rely on such a restrictive identifiability condition and also 

100 work well for small sample sizes. 

101 

102 Figure 1 illustrates this pragmatic approach to analyse heavy-tailed data: applied researchers can make 

103 their observed data y as most Gaussian as possible (x^) before making inference based on their favorite 

104 Gaussian model M^f■ This avoids the development of - or the data analysts waiting for - a whole new 

105 theory of A^g £^nd new software implementations based on a particular heavy-tailed distribution G, while 

106 still improving statistical inference on skewed/heavy-tailed data y. For example, consider the simulated 

107 y = {yi, . . . , usoo) from a standard Cauchy distribution C(0, 1) in Fig. 2a: by Modelling the heavy tails by 
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(a) Random sample y ~ C(0, 1) from a (b) Gaussianised Cauchy ^9mle ""'^^ (c) Cumulative sample average, r'") es- 
standard Cauchy (A^ = 500) f = (0.030, 1.054, 0.861) timated for each fixed n = 5, . . . , 500, be- 

fore Gaussianising the data (j/i, . . . ,y„). 

Figure 2: Gaussianising a Cauchy sample. 

108 a transformation based method - rather than a particular density shape - it is possible to Gaussianise even 

109 this Cauchy sample (Fig. 2b). This "nice" data can then be used for subsequent statistical analysis, e.g. 
no estimating the location by the sample average (Fig. 2c) which is a bad choice for the Cauchy sample, but a 
111 good choice for the Gaussianised version. For a more detailed analysis of this data see Section 6.1. 

112 

113 The main contributions of this work are three-fold: a) following the recently introduced skewed Lambert 

114 W X Fx distributions (Goerg, 2011) I introduce a meta-family of heavy tail Lambert W x Fx distributions 

115 which include Tukey's h distribution (Tukey, 1977) as a special case; b) I derive the analytic inverse and 

116 thus get a bijective transformation to "Gaussianise" heavy-tailed data (Section 2); c) I also provide simple 

117 expressions for the cumulative distribution function (cdf) Gyiy) and probability density function (pdf) gviy) 
us - including Tukey's h and hh distribution-, which can be easily implemented in standard statistics package 

119 (Section 2.4). To the author's knowledge analytic expressions for Tukey's h cdf and pdf are presented here 

120 (Section 3) for the first time in the literature. Section 4 introduces a methods of moments estimator for the 

121 optimal inverse transformation and studies properties of the maximum likelihood estimator (MLE). Section 

122 5 shows their finite sample properties. 

123 As has been shown in many case studies, Tukey's h distribution (heavy tail Lambert W x Gaussian) is 

124 useful to model data with unimodal, heavy-tailed densities. Section 6 not only confirms this finding for 

125 S&P 500 log-returns, but also demonstrates the benefits of removing heavy tails from data for exploratory 

126 data analysis: in particular, Gaussianising 7-ray intensity data reveals a truly bimodal density, which even 

127 non-parametric estimators fail to detect if heavy tails are not removed. 

128 

129 Computations, figures, and simulations were done with the open-source statistics package R (R Dcvel- 

130 opmcnt Core Team, 2010). Functions used in the analysis and many other methods are available in the R 

131 package LambertW, which provides necessary tools to perform Lambert W inference in practice. 
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132 2 Modelling heavy tails using transformations 

133 As discussed in the previous section, many statistical methods have to be adapted in presence of heavy tails 

134 in the data. Random variables exhibit heavy tails if more mass than for a Gaussian RV lies at the outer end 

135 of the density support. One common definition^ for a RV Z to have a heavy tail with tail index a is that its 

136 cdf satisfies 1—-Fz{z) ^ L{z)z^°' , where L{z) is a slowly varying function at infinity, i.e. Wtclz^oo = 1 for 

137 all t > (Back and Pipiras, 2010). The heavy tail index a is an important characteristic of Z; for example, 

138 only moments up to order a exist. 



2.1 Tukey's h distribution 

A transformation based approach to heavy tails as discussed in the Introduction is the basis of Tukey's h 
RVs (Tukcy, 1977) 

Z = f/exp ( ) , /i>0, (1) 



where U is standard Normal and h is the heavy tail parameter. Tukey's h RVs are parametric heavy tail 
versions of a Gaussian RV with tail parameter a ~ 1/h (Morgenthaler and Tukey, 2000) , which reduce to the 
Gaussian for h ~ 0. Morgenthaler and Tukey (2000) extend the h distribution to the skewed, heavy-tailed 
family of hh RVs 

fc/exp(|t/2), ift/<0, 
|[/exp(|^C/2) , ifJ7>0, 

where again U ^ J\f{Q,l). Here Sg > and Sr > shape the left and right tail of Z, respectively; thus 
transformation (2) can model skewed and heavy-tailed data - see Fig. 3a. 

Their use in statistical practice is limited however, as the inverse of (1) or (2) have not been available in 
explicit, closed form. Consequently, no closed-form expressions for the cdf or pdf are available. Although 
Morgenthaler and Tukey (2000) express the pdf of (1) as {h = 6) 

9z{z) = ,w )rT-lr ^^ ' (3) 



142 they fall short of explicitly specifying H^^{z). So far this inverse has been considered analytically intractable 

143 Field (2004), or only possible to approximate numerically (Fischer, 2010; Todd C. Headrick and Sheng, 2008). 

144 Thus parameter estimation and inference relies on matching empirical and theoretical quantiles (Field, 2004; 

145 Morgenthaler and Tukey, 2000), or by the method of moments (Todd C. Headrick and Sheng, 2008). Only 

146 recently Todd C. Headrick and Sheng (2008) provided a numerical approximation to the cdf and pdf. Numer- 

147 ical approximations slow down the estimation of any statistical model - let it be in a frequentist or Bayesian 

148 setting. Hence, a closed form, analytically tractable pdf that can be computed efficiently is essential for a 

149 wide-spread use of Tukey's h (& variants) distribution. 

150 

151 Here I provide this inverse transformation and thus also an easily computable cdf and pdf, which can be 

152 implemented in standard statistics packages. For ease of notation and concision main results are shown in 

^ There are various similar, but not exactly equivalent definitions of heavy-tailed RVs / distributions; for the context of this 
work these differences are not essential. 
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153 detail for the symmetric case 5^ = 6,, = S; analogous results for 5i 7^ Sj, will be stated without a detailed 

154 analysis. 

155 2.2 Heavy tail Lambert W Random Variables 

156 Tukey's h transformation (1) is strongly related to the approach taken by Goerg (2011) to introduce skewness 

157 in continuous RVs X ~ Fx{x). It even holds that if Z ^ Tukey's h, then has a skewed Lambert W xxf 

158 distribution with skew parameter j ~ h. 

159 Adapting the skew Lambert W x Fx input/output idea" (see Fig. 1), Tukey's h RVs can be generalized 

160 to heavy-tailed Lambert W x Fx RVs. 

Definition 2.1 (Non-central, Non-scaled Heavy-tailed Lambert W x Fx Random Variable). Let U be a 

continuous RV with cdf Fjj [u \ (3) and pdf fjj {u\ (3), where f3 is a possible parameter vector of Fx{u). 
Then, 

Z := U exp (^^U^^ , 6 eR, (4) 

161 is a non-central, non-scaled heavy tail Lambert W x Fx RV with parameter vector 9 — {(3,6), where 5 is 

162 the tail parameter. 

163 Tukey's h distribution results for U being a standard Gaussian M{0,1). For location-scale input, e.g. a 

164 general Gaussian RV it is necessary to extend Definition 2.1. 

Definition 2.2 (Location-scale Heavy-tailed Lambert W x Fx Random Variable). For a continuous location- 
scale family input X ~ Fx{x \ f3) define a location-scale heavy-tailed Lambert W x Fx RV 

r := |i7exp(^^[/2^|(7, + Ai,, SeR, (5) 

165 with parameter vector 6 = {f3,S), where U = {X — fj.x)/crx is the zero-mean, unit variance version of X . 

166 The input is not necessarily Gaussian but can be any other location-scale continuous RV, e.g. uniform 

167 X ^ U{a,b). For scale family input - such as X ^ r(a, b) - the following definition will be used. 

Definition 2.3 (Scaled Heavy-tailed Lambert W x Fx Random Variable). Let X be a continuous RV from 
a scale-family distribution Fx{x/s \ (3). Let ax be the standard deviation of X and U — Xjox. Then, 

y := Xexp , SeR, (6) 

168 is a scaled heavy-tailed Lambert W x Fx RV with parameter vector 6 = {(3, S). 

169 Let T :— {^,x{l3),ax{fi),S) be the parameter vector^ that defines transformation (5). For simplicity and 

170 readability let Hs{u) :— uexp (fu^). 

^Most concepts, terminology, and methods of the skew Lambert W case relate one-to-one to the heavy tail Lambert W RVs 
presented here. Thus for the sake of concision I refer to Goerg (2011) for a detailed account and background information of the 
Lambert W framework. 

''For non-central, non-scale input set r = (0, 1, S) and for scale-family input let t = (0, ctx, (5). 
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171 The shape parameter S (= Tukey's h) governs the tail behaviour of the transformed RV Y: for ^ > values 

172 further away from /i^ are increasingly emphasized, leading to a heavy-tailed version of Fx{x); for S — the 

173 output Y = X input; and for S < values far away from the mean are mapped back again to values closer 

174 to Thus heavy tail Lambert W x Fx RVs generalize X ^ Fx[x) to a new class of heavy-tailed versions 

175 Y ^ Gyiy) of itself with a reduction to the original for (5 = 0. 

176 The Lambert W formulation of heavy tail Modelling is more general than Tukey's h distribution in the 

177 sense that X can have any distribution Fx{x), not necessarily Gaussian (Fig. 4). 

17S Remark 2.4 (Only non-negative 5). Although 5 <Q leads to interesting properties ofYI will not discuss it 

179 any further: S < results in a non-bijective transformation and consequently to parameter- dependent support 

ISO and non-unique input. Thus for the rest of this study I will tacitly assume 6 > 0, unless stated otherwise. 

181 In this case, if X has support on (— oo, oo), then for all 6 > also the location-scale Y G (— oo, oo). For a 

182 scale family X e [0, oo) also the scale Lambert W x Fx RVY g [0,oo). 



2.3 Inverse transformation: "Gaussianise" heavy-tailed data 

For S > transformation (5) is bijective and its inverse can be obtained via Lambert's W function, which 
is defined as the inverse of 2 = wcxp(u), i.e. that function that satisfies W{z) exp{W{z)) = z. Lambert's W 
has been studied extensively in mathematics, physics, and other areas of science (Corless, Gonnet, Hare, and 
Jeffrey, 1993; Rosenlicht, 1969; Valluri, Jeffrey, and Corless, 2000), and is implemented in several standard 
software packages via the GNU Scientific Library (gsl) (Galassi, Davics, Thcilcr, Gough, Jungman, Aiken, 
Booth, and Rossi, 2011). Only very recently it received attention in the statistics literature (Goerg, 2011; 
Jodra, 2009; Pakes, 2011; Rathie and Silva, 2011). It has several useful properties (see Appendix A and for 
more details Corless et al. (1993)), in particular W{z) is bijective for z > 0. 

Lemma 2.5. The inverse transformation of (-5) is 

Wr{Y) := Ws + = Ua^ + l^x=X, (7) 

where 




Ws{z) :=sgn(z) \ M , (8) 



192 and sgn(z) is the sign of z. The function Ws{z) is bijective for all d > and all z G M. 

193 Proof. See Appendix B. □ 

194 Lemma 2.5 gives for the first time an analytic, bijective inverse of Tukey's h transformation; Hg'^{y) 

195 of (Morgenthaler and Tukey, 2000) is now analytically available as (7). Bijectivity implies that for a given 

196 dataset y and parameter vector r, the exact corresponding input x^. = Wr{y) with cdf Fx{x) can be obtained. 

197 In view of the importance and popularity of Gaussianity, we clearly want to back-transform skewed, heavy- 

198 tailed data to something Gaussian rather than yet another heavy-tailed distribution. Typically tail behaviour 

199 of RVs are compared by their fourth central standardized moment ^2{X) = E{X — fi^)'^ / <^x ~ i-*^- their kurtosis; 

200 for a Gaussian RV 72 (^) = 3. Hence it is natural to set 3 as the reference value, and for the future when 

201 we "normalize the data y" we not only subtract the mean, and divide by the standard deviation, but also 
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(a) hh = SiSr transformation (2) (b) Inverse transformation Wsi^^s,.(^) (^) Inverse transformation Ws{z) (8) as 

(9) a function of 5 and z 



Figure 3: Transformation and inverse transformation for 6i = and 6r = 1/10: identity on the left (same 
tail behaviour) and a heavy-tailed transformation in the right tail of input U. 



202 transform it to data x,- with 72 (x^.) = 3 - a "Normalization" in the true sense of the word (see Fig. 2b). 

203 This data-driven view of the Lambert W framework can also be useful for non-parametric density estimation, 

204 where multivariate data is often pre-scaled to unit- variance, so a kernel density estimator (KDE) can use the 

205 same bandwidth h in each dimension (Hwang, Lay, and Lippman, 1994; Wasserman, 2007). "Normalizing" 

206 the data the Lambert Way does not only pre-scale the data, but also improves KDEs for heavy-tailed data 

207 (see also Maiboroda and Markovich, 2004; Markovich, 2005; Takada, 2001). 

Corollary 2.6 (Inverse transformation for Tukey's hh). The inverse transformation of (2) is 

W8,,sM = < (9) 

208 Figure 3b shows Wg^.s^iz) for 61 — and dr — 1/10. The original transformation in Fig. 3a generates 

209 a right heavy tail version of the input U (x-axis) as it stretches the positive axis (y-axis), but leaves the 

210 negative axis the same. By definition Ws,^Sr{z) does the opposite and removes the heavier right tail in Z 

211 (positive y-axis) back to the original U. Fig. 3c shows how Ws{z) operates for various degrees of heavy tails 

212 and z E [0, 3]. li 6 is close to zero, then also Ws{z) « z; for larger S, the inverse maps z to smaller values, in 

213 particular if z is also large. 

Remark 2.7 (Generalized transformation). Transformation (1) can be generalized to 

Z = C/exp (^-^U^"^ , where C/^" = (U^)" ,a> 0. (10) 
The inner term guarantees that the transformation is bijective for all a > 0. The inverse transformation 



IS ^ 

WsAz) --^ sgniz) (w I^^^Y" . (11) 

For the sake of comparison with Tukey's h distribution I will here consider the a — 1 case only. For 
a — 1/2 it is closely related to the skewed Lambert W x Fx distributions. Future research can study the 
general case of heavy-tail Lambert W x Fx distributions for arbitrary a > 0. 
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217 2.4 Distribution and Density Function 

For ease of notation let 



z= u^Ws{z), X = Wr{y) ^ ua^ + n^. (12) 

218 Theorem 2.8 (Distribution and Density of Y). The cdf and pdf of a location-scale heavy tail Lambert W x 

219 Fx RV Y equal 

Gy (y I /3, <5) = Fx {Wsiz)a, + ^ix \ /3) (13) 

and 



9Y [y \f3.5) = fx[Ws[ ^—^ + I /3 



l + W[5 



(14) 



(Ws (^) + I (3) . e-'^M^r ^ ^, (15) 

220 where the second equality follows as by definition ^^J^-* = _ 

221 Clearly, Gy {y \ = Q) = Fx [y \ /3) and gY {y \ f3,S = 0) = fx (y | /3), since lima^o Ws{z) = z and 

222 lim^^o WiSz"^) = for all z G M. 

223 Proof. See Appendix B. □ 

224 For the cdf and pdf of scale family or non-central, non-scale input set /i^; = or = 0, CTj; = 1 in Theorem 

225 2.8. 

226 The explicit formula (14) allows a fast computation and theoretical analysis of the likelihood for any input 
™ /x(')i which is essential for any kind of - either frequentist or Bayesian - statistical analysis. A more detailed 

228 analysis of the functional form in (14) and its properties is given in Section 4.1 on maximum likelihood 

229 estimation. 



230 
231 
232 
233 

234 
235 



Figure 4 shows (13) and (14) for various 5 >{) with for four different input X ^ Fx{x \ (3): for (5 = = 
the input equals the output (solid black) ; for larger 5 the tails of Gy (y | 9) and qy (y | d) get heavier (dashed 
colored). 

Corollary 2.9 (Cdf and pdf of the double-tail {hh) Lambert W x Fx RV). The cdf and pdf of Z in (2) 

equal 

iGziz\l3,5e), ifz<0, 
Gz{z\l3,S,,Sr)^{ 1^' - ' (16) 

\Gz{z\(3,Sr), ifz>0, 
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(a) Lambert W X (b) Lambert W X r(s,r) (c) Lambert W X Af{l^,cr^) (d) Lambert W X U{a,b) 

with/3 = fc = l. with /3 = (s,r) = (3, 1). with /3 = (/i, cr) = (0, 1). with /3 = (a, b) = (-1, 1). 



Figure 4: The pdf (top) and cdf (bottom) of a heavy-tail (a) "non-central, non-scaled" , (b) "scale" , and (c 
and d) "location-scale" Lambert W x Fx RV Y for various degrees of heavy tails (colored, dashed lines). 



237 2.5 Quantile Function 

The quantile function Tukey's h RV Y has been very important in statistical practice, as quantile fitting 
has been the standard procedure to estimate fix, cTx, and d (or Sg and dr)- In particular, the median of Y 
equals the median of X. Thus for symmetric, location-scale family input the sample median of y is a robust 
estimate for ii^ for any 6 >0 (see also Section 5). General quantiles can be computed via (Tukey, 1977) 



238 where Ua ■= Ws{za) are the a-quantiles of the input distribution Fu{u). As quantiles of U are typically 

239 tabulated, or easily available in software packages, (18) can be computed very efficiently using and r. 



241 This simple calculation is especially useful for statistical education: teaching heavy-tailed statistics in 

242 introductory courses will soon become too difficult using Cauchy, Levy or a-stablc distributions. Yet, back- 

243 transforming data via Lambert's W, using previously learnt methods for the Gaussian case, and then trans- 

244 forming the inference back to the "heavy-tailed world" - e.g. transforming quantiles via using (18) - is 

245 straightforward. Thus the Lambert W x Fx framework can promote heavy-tailed statistics with enduring 

246 value in introductory statistics courses. 



236 and 




(17) 




(18) 



240 
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3 Tukey's h Distribution: Gaussian Input 

For Gaussian input Lambert W x Fx equals Tukey's h distribution, which has been studied extensively in 
the literature. Dutta and Babbel (2002) show that 

{0, if n is odd and n < 

"'■'l^l1T ' ifnisevenandn<l, (19) 
ifn>i, 

which in particular implies (Todd C. Headrick and Sheng, 2008) 

EZ = EZ^ = if (5 < 1 and 1/3, respectively (20) 

(13^5)^ if ^< 2' ^^'-3(T3]^if^<4- (21) 
Thus the kurtosis of a heavy tail Lambert W x Gaussian RV Y equals (see Fig. 5) 

^^^'^-'i^W^^'^'^ (22) 

For S — expressions (21) and (22) reduce to the familiar Gaussian values. Expanding (22) around this 
Gaussian point 5 = yields 

72((5) = 3 + 125 + 66(52+0(^3). (23) 



Ignoring 0{S^) and solving the quadratic equation (taking the positive root) gives a rule of thumb estimate 

v/66 72(y)- 162-6]^, (24) 
where 72 (y) is the sample kurtosis estimate from data y and [a]^ — max(a,0). 



^Taylor — gg 



Corollary 3.1 (Cdf and pdf of Tukey's h distribution). The cdf of Tukey's h distribution equals 

Gy {y I /i.,a.,<5) = $ f^^M^) , (25) 



where $(u) is the cdf of a standard Normal. The pdf equals (for 5 > 0) 

gy{y I A.., a., 5) = cxp (-^M^. (^Vl — "a (26) 



l + W [5 



Proof Take X ^ Af (^^, ct^) in Theorem 2.8. □ 
Section 4.1 studies functional properties of gviv \ in more detail. 
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(a) Variance 



(b) Kurtosis 



Figure 5: Variance and kurtosis of Tukey's h 
function of 6 and l/v, respectively. 



Lambert W x Gaussian versus student-t distribution as a 



3.1 Tukey's h versus student's t 

Student's ti, distribution with v degrees of is often used to model heavy-tailed data (Wong, Chan, and Kam, 
2009; Yan, 2005), as its tail index equals ly. Thus the nth moment of a student t RV T exists ii n < v. In 
particular, 

ET = ET^ = if 1/ < 1 or < 3, ET^ = = ^^-^ if - < (27) 



and kurtosis 



v-A 



1 



V 4 



(28) 



253 Comparing the second expression of (28) with (22) shows a natural association between \/v and 5 and 

254 also a close similarity between student's t and Tukey's h distributions in terms of their first four moments 

255 (Fig. 5). By continuity and monotonicity of variance and kurtosis as functions of \/v and 5, it is clear that 

256 the first four moments of a location-scale t distribution can always be perfectly matched by a corresponding 

257 location-scale heavy-tail Lambert W x Gaussian distributions. Thus if student's t is used to model heavy 

258 tails, and not as the true distribution of a test-statistics, it might be worthwhile to also fit heavy tail Lambert 

259 W X Gaussian distributions for an equally valuable "second opinion" on properties of the data. For example, 

260 a parallel analysis of S&P 500 log-returns in Section 6.2 leads to divergent inference regarding the existence 

261 of fourth moments. Additionally, the Lambert W approach allows to Gaussianise the data and thus might 

262 reveal hidden patterns in the data - which can be easily overseen in presence of heavy tails (Section 6.3). 



263 4 Parameter Estimation 

264 For a sample of N independent identically distributed (i.i.d.) observations y = (j/i, . . . ,yN), which presum- 

265 ably originates from transformation (5), 6 = {(3,5) has to be estimated from the data. Due to the lack of a 

266 closed form pdf of Y , this has been typically done by matching quantiles or a method of moments estimator 

267 (Field, 2004; Morgenthaler and Tukey, 2000; Todd C. Headrick and Shcng, 2008). Using the pdf (14) these 
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268 methods can now be replaced by the - usually optimal - maximum likelihood estimator (MLE). 



4.1 Maximum Likelihood Estimation 

For an i.i.d. sample y ~ 5y (y | /3, the log-likelihood function equals 



N 



£{0\y)^Y.^oggY{y^\f3,S). (29) 



The MLE is that = {(3,6) which maximizes (29), i.e. 



Oaile = [f3,6) = arg max e{(3,S\y). 

\ /MLE f3,S 

Since gviVi \ (^,S) is a function of fx{xi \ l3), the MLE depends on the specification of the input density. 
Due to the multiplicative term in (14), expression (29) can be decomposed in two additive terms 

i{f3,5\y)=£{(3\^r)+n{T\y), (30) 

where 

£(/3|xO = ^log/x ( Ws ( ^^^^] a, + \ f3] ^J2^og fx {^r\ (3) (31) 
is the log-likelihood of the back-transformed data and 

n 

n{T\y)=Y,^ogR{fi,,a,,5\y,), (32) 

i=l 

where 

ws(y-^) 

R{fix,a,,S\y,) = . — ^. (33) 



1 



270 Note that R (nx, (Jx, S \ yi) only depends on /Ua;(/3) and cTx{(3) (and 5), but not necessarily on every coordinate 

271 of (3. 

272 Decomposition (30) shows the difference between the exact MLE {j3, 5^ based on y and the approximate 

273 MLE (3 based on back-transformed data x^-: if we knew r — {fix,crx,S) beforehand, then we could back- 

274 transform y to x^ (no r since the inverse transformation is assumed to be known) and compute /9jv/le 

275 based on x^ (maximize (31) with respect to /3). In practice, however, r must also be estimated and this 

276 enters the likelihood via the additive term TZ(t \y). It can be shown that for any S M the expression 

277 logi? (px, (Jx, S \ yi) < ii 6 > 0, with equality if and only ii S = 0. Thus 72. (r | y) can be interpreted as a 

278 penalty for transforming the data. Maximizing (30) faces a trade-off between transforming the data to follow 
™ fx{x I (3) (and thus increasing ^ (/3 | xy)) versus the penalty of doing a more extreme transformation (and 

280 thus decreasing TZ{t \ y)). 

281 Figure 6a shows the contour plot of R (px = 0,ax — \ y) as a, function of S and y = z. The penalty for 

282 transforming the data increases (in absolute value) either if 5 gets larger (for any fixed y) or for larger y (for 
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(a) Penalty term R{fix,'^x,S \ yi) (33) in the full likelihood (b) (left) random sample z of Tukey's h with U ~ A^(0, 1) 
£{fi,S I y) as a function of S and y. Input set to /ix = and and h = 5 = 1/3; (right) additive decomposition of the 
(Tj; = 1. log-likelihood £{6 \ y) (solid, green) in input log-likelihood 

(dashed blue) and penalty term (red dashed). Vertical lines 
show true 5 = 1/3 and &mle ~ 0.37 

Figure 6: The log-likelihood i{9 \y) decomposition for Lambert W x Fx distributions. 



283 any fixed 5). In both cases, increasing 5 will make the transformed data Ws{z) get closer to = /i^;, which in 

284 turn increases its input likelihood. For (5 = there is no (multiplicative) penalty since input equals output; 

285 for 2/ = there is no penalty since ^^(0) = for all 5. 

286 Figure 6b shows (left) a random sample {N — 1000) z ~ Tukey's h with 5 = 1/3 and the decomposition 

287 of the log-likelihood as in (30). Since (3 = (0, 1) is known, likelihood functions and penalty terms are only 

288 functions of 5 (and the data z). The monotonicity of the penalty term (decreasing) and the input likelihood 

289 (increasing) as a function of 5 is not a property of this particular sample, but holds true in general (see 

290 Theorem 4.1 below). By this monotonicity in each component it follows that their sum (green line) has a 

291 unique maximum; here Smle = 0.37 (red dotted vertical line). 

292 

293 The maximization of (30) can be carried out numerically. Here I show existence and uniqueness of Smle 

294 assuming that fix and ax are known. Theoretical results for 9mle remain a task for future work. Given the 

295 "nice" form of gviy) - continuous, twice differentiable, "'support does not depend on the parameter, etc. - 

296 the MLE for 9 = {l3,5) should have the standard optimality properties (Lehmann and CascUa, 1998). 

■'Assuming that /x(') is twice differentiable. 



15 



4.1.1 Properties of the MLE for the heavy tail parameter 

Without loss of generality (and for better readability) let fi^ — and ^ 1. In this case the likelihood 
function simplifies to 

l{S\z)^--Y, [Ws{z^)f + J2 log - log (l + ^ [Wsiz.)]') (34) 



1=1 1=1 

N N 



^ + ^ ^ [Ws{z,)f - ^log (l + <5 [Ws{z,)f) . (35) 



2 

1=1 



298 Theorem 4.1 (Unique MLE for S). Let Z have a Lambert W x Gaussian distribution, where = ^ o,nd 

299 (Tx = 1 are assumed to be known and fixed. Also consider only the case 5 € [0, cxd)/'' 



a) If 

Z^i=l 
Z^i=l 

300 i/ien Smle — 0. 

301 // (3G) does not /lo/rf, t/ien 

b) Smle > exists and is a positive solution to 

N 



< 3, (36) 



302 There is only one such 5 satisfying (37), i.e. Smle is unique. 

303 Proof sketch. See Appendix B for a detailed proof. 

304 a) If condition (36) holds, then D{S \ z) := g|^((5 | z) is negative at 5 = and stays negative for all S > 0. 

305 Hence the maximizer occurs at 5 = 0. 

306 b) If (36) does not hold, then D{S = | z) > 0, decreases in S and crosses the zero line (one candidate for 

Smle occurs here). 

308 c) As 5 gets larger, D{S \ z) reaches a minimum (negative value) and starts increasing again. However, for 

309 (5 — > oo the derivative approaches zero from below and never equals zero again; thus Smle is unique. 

310 EH 

311 Condition (36) says that the MLE only yields positive estimates if the data is heavy-tailed enough. Points 

312 b) and c) guarantee that there is no ambiguity in the heavy tail estimate. This is an advantage over student's 

313 t distribution, for example, which has numerical problems and local maxima for unknown (and small) (o 



^Whilc for some samples z the MLE also exists allowing all (5 £ R, it can not be guaranteed for all z. The reason lies again 
in special properties of the Lambert W function. If 5 < (and z 0), then Ws{z) is either not unique in R (principal and 
non-principal branch) or may not even have a real-valued solution. 
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Algorithm 1 Find optimal S: function delta_GMM(-) in the LambertW package. 
Input: standardized data vector z; theoretical kurtosis j-ziX) 
Output: Sqmm as in (38) 

1: Sgmai = Jirgmina ||72(u5) - 72(-'^)||, where u^ = Ws(z) subject to S > 
2: return Sgmai 



314 large S) (see also Breusch, Robertson, and Welsh, 1997; Fernandez and Steel, 1999; Liu and Rubin, 1995). 

315 The global maximum property of Smle holds for any 6 > 0. 

316 For future theoretical analysis regarding the MLE it is worthwhile to point out that the log-likelihood and 

317 its gradient depend on 6 and z only via Ws{z). Given the heavy tails in z (for (5 > 0) we might expect that 
31S larger 5 lead to difficulties in the evaluation of integrals (e.g. expected log- likelihood. Fisher information). 

319 However, Ws{Z) ~ A^(0, 1) for the true 6 > Q, and close to a standard Gaussian if 5mle ~ S. Thus the 

320 performance of the MLE should not get worse for large S as long as the initial estimate is close enough to 

321 the truth. Simulation studies in the next section support this conjecture, even for the joint MLE Omle- 

322 A disadvantage of the MLE is the mandatory a-priori specification of the input distribution. Especially for 

323 heavy-tailed data the eye is a bad judgement to choose a particular parametric form of the input distribution. 

324 It would be useful to estimate r directly from the data, without the intermediate step of estimating 6 first 

325 (and thus no distributional assumption for the input is necessary). 

326 4.2 Iterative Generalized Method of Moments (IGMM) 

327 Here I present an iterative method to obtain r, which builds on the input/output aspect and relies upon 

328 theoretical properties of the input X. For example, if a random variable should be exponentially distributed 

329 (e.g. waiting times), but the observed data shows heavier tails then it is natural to estimate = A^^ and 

330 S such that the back-transformed data has skewness 2, as this is a particular property of exponential RVs 

331 - independent of the rate parameter A; to remove heavy tails in y we should choose t such that the back- 

332 transformed data x,- has sample kurtosis 3; or for uniform input, we can try to find a t such that has a 

333 flat density estimate. 

334 

335 Here I describe the estimator to remove heavy-tails in location-scale data, in the sense that the kurtosis of 

336 the input equals 3. It can be easily adapted to match other properties of the input as outlined above. 

For a moment assume that fix — M^^^' a-^d ax — cr^^ are known and fixed; only S has to be estimated. 
A natural choice for S is the one that results in back transformed data x^- (t — {fj,^x\ cr^\ S)) with sample 
kurtosis 72(xr) equal to the theoretical kurtosis 72(A). Formally, 

Sgmm = arg mm 1 172 (A) - 72 (x^) 1 1 , (38) 

d 

337 where ||-|| is a proper norm in M. 

338 While the concept of this estimator is identical to its skewed version (Goerg, 2011), it has one important 

339 advantage: the inverse transformation is bijective. Thus here we do not have to consider "lost" data points 

340 when applying the inverse transformation. 
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Algorithm 2 Iterative Generalized Method of Moments: function IGMM(-) in the LambertW package. 



Input: data vector y; tolerance level tol; theoretical kurtosis '^2{X) 
Output: IGMM parameter estimate tiqmm = (J'^xt^x,^) 

1: Set r(-i) = (0,0,0) 

2: Starting values: r^^^ = (/il*'\ S^^^), where /ii""* = y and = ay ■ ^ ^^-^ 2Mo))3/2 ^ sample 
median and standard deviation of y divided by the standard deviation factor (see also (21)), respectively. 
= ^ (v/6672(y) - 162 - 6^ ^ see (24) for details. 
3: fc = 

4: while I IrC") - r^"^^) 1 1 > tol do 

5: zW = (y-Mi'^)M'=) 

6: Pass z^*^) to Algorithm 1 — > d^''+^'> 

7: back-transform z^^^) to ut'^+i' = Wgik+i) (z^^^^); compute x^^+i) = u'^^+i) di''^ + ^''^ 

8: Update parameters: fj^i^^^^ = Xfe+i and ai'^^^^ = (y^^^^-^ 

9: r('=+i) = (Mi'+'\ai'=+^\<5('=+i)) 

10: fc fc + 1 

11: return tjgmm = t'-''^ 



341 Discussion of Algorithm 1: The kurtosis of F as a function of S is continuous and monotonically in- 

342 creasing (see (22)). Also u = Ws{z) has a smaller slope than the identity u — z, and the slope is decreasing 

343 as 6 is increasing. Thus if the kurtosis of the original data is larger than the objective kurtosis of the back- 

344 transformed data, 72(y) > l2iX), then there always exists a (5^*-' that achieves 72(Xt-) = 72(A') for the 

345 back-transformed data. 

346 By re-parametrization to S — log 5 the bounded optimization problem can be turned into an unbounded 

347 one, and solved by standard optimization algorithms. 

348 

349 In practice, Hx and are rarely known but also have to be estimated from the data. As y is shifted and 

350 scaled ahead of the back-transformation Ws{-), the initial choice of and ax affects the optimal choice of 

351 6. Therefore the optimal triple r = (jlxj^x,^) rnust be obtained iteratively. 

352 Discussion of Algorithm 2: Algorithm 2 first computes z^'''' = (y — ^iif')/a^^ using ^^x^ and a^^ from 

353 the previous step. This normalized output can then be passed to Algorithm 1 to obtain an updated (5^'^+^) :— 

354 Sgmm- Using this new J^*-'"*'^) one can back-transform z^*^^ to u^'^^-^^ = Wg(k+i) (z^'^)), and consequently obtain 

355 a better approximation to the "true" latent x by x*^'^"'""'^-' — u'^'^^^-' cri'^'' -|- However, J^'^+i) - and therefore 

356 x'^'"'+^^ - has been obtained using fi^^ and ai''^ which are not necessarily the most accurate estimates in light 

357 of the updated approximation x^^(fc) ^(t) ^(^+1)^- Thus Algorithm 2 computes new estimates ii'x^^^ and cri*'"*^^' 

358 by the sample mean and standard deviation of ^^^(fo) ^{k) g{k+i)y and starts another iteration by passing the 

359 updated normalized output z''^'^^^' = ^ '(k+i) — to Algorithm 1 to obtain a new 

360 It returns the optimal tigmm once the estimated parameter triple does not change anymore from one iter- 

361 ation to the next, i.e. if | \t^^'> — t^^+'^) | | < tol. 

362 

363 An advantage of the IGMM estimator is that it requires less specific knowledge about the input, and can 
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Algorithm 3 Random sample generation: function rLambertW(-) in LambertW package. 

Input: number of observations n; parameter vector 0; specification of the input distribution Fx{x) 
Output: random sample (yi, . . . , y„) of a Lambert W x Fx RV. 

1: Simulate n samples x = (xi, . . . , Xn) ~ Fx{x). 

2: Compute /ij. = fJ-xifi) and = (Jx{fi) 

3: Compute normalized u = (x — [ix)l(yx- 

4: z = uexp (|u^) 

5: return y = zCTj; + [ix 



364 be just seen as a data transformation rather than an accurate, "true" statistical model for the data. Usually, 

365 it is also faster than the MLE. Once tigmm has been obtained, the latent input data x can be recovered 

366 via the inverse transformation (7). This new data x^^^^j^j can then be used to check if X has characteristics 

367 of a known parametric distribution Fx {x \ j3), and thus is an easy test if y follows a particular heavy-tail 

368 Lambert W x Fx distribution. It must be noted that tests are too optimistic as xy will have "nicer" 

369 properties regarding Fx than the true x would have. However, estimating the transformation requires only 

370 three parameters and for a large enough sample, losing three degrees of freedom of the test-statistics should 

371 not matter in practice. 

Remark 4.2 (IGMM for double-tail Lambert W x Fx)- For a double-tail fit the one- dimensional optimiza- 
tion in Algorithm 1 has to he replaced with a two-dimensional optimization 

\^^'^^) r..,., " argmin/i(72(X) -72(x(^.,^.,5,,5^))) . (39) 

372 Algorithm 2 remains unchanged. 

373 5 Simulations 

374 Since this class of distributions is based on transformations of RVs rather than on a manipulation of the 

375 pdf or cdf, generating random samples is straightforward (Algorithm 3). This section explores finite sample 

376 properties of estimators for 9 ~ {^x,<7x,S) and (/ly, ay) under Gaussian input X ^ Af{^x, f^)- In particular, 

377 it compares Gaussian MLE (estimation of Hy and ay only), IGMM and Lambert W x Gaussian MLE, and 

378 - for a heavy tail competitor - the median. For IGMM, optimization over the heavy-tail parameter 6 was 

379 restricted to [0, 10] as larger values resulted in a numerical overflow in lambertW_0 of the gsl package. 

380 All results shown below are based on n = 1, 000 replications. 

381 5.1 Estimating 6 only 

382 Here I show finite sample properties of 6mle assuming standard Gaussian input U ^ A/'(0, 1), i.e. fj.x = 

383 and ax — I are fixed and only i5 is estimated given the sample (zi, . . . , zjy) ~ z ~ u ■ e^"^. Theorem 4.1 

384 shows that the MLE of 5 is unique: either a boundary extremum at (5 = or the globally optimal solution 

385 to (37). The results reported in Table 1 were obtained via numerical optimization restricted to 6 > 0.^ 

^This restricted optimization problem was again transformed into an unrestricted one, using 5 = log 5. Then the function 
nlm in R was used to obtain the global optimum 5* , which then was transformed back to 5* = . 
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N 


6^ 





S ^ 


1/10 


(5 = 


1/3 


S = 


1/2 


1 n 




0.191 




— n ni 7 


0.394 




— n 049 

U.U4:Z/ 


0.915 






1 ^f\7 
±. i\j t 


50 


0.013 


0.187 




-0.010 


0.492 




-0.018 


0.931 




-0.016 


1.156 


100 


0.010 


0.200 




-0.010 


0.513 




-0.009 


0.914 




-0.006 


1.225 


400 


0.005 


0.186 




-0.003 


0.528 




0.000 


0.927 




-0.004 


1.211 


1000 


0.003 


0.197 




0.000 


0.532 




-0.001 


0.928 




-0.001 


1.203 


2000 


0.003 


0.217 




-0.001 


0.523 




0.000 


0.935 




-0.001 


1.130 


N 


s = 


1 


S 


= 2 


s = 


= 5 






1 n 


— U.U04 


1.987 






3.384 




— u.uou 


7.601 








50 


-0.017 


1.948 




-0.009 


3.529 




0.014 


7.942 








100 


-0.014 


2.024 




-0.001 


3.294 




0.011 


7.798 








400 


0.001 


1.919 




-0.002 


3.433 




0.001 


7.855 








1000 


0.001 


1.955 




0.001 


3.553 




-0.001 


7.409 








2000 


0.001 


1.896 




0.000 


3.508 




-0.001 


7.578 









Table 1: Finite sample properties of 5mle- For each iV, 6 was estimated n — 1,000 times from a random 
sample z ~ Tukey's h. The left column for each 5 shows the average bias, Smle ^ ^ ; each right column 
shows the root mean square error (RMSE) times ^fN . 



386 I consider various sample sizes N = 50, 100, 400, 1000, and 2000 as well as a wide range of (5 G {0, 1/10, 1/3, 1/2, 1, 2, 5}. 

387 Table 1 shows that the MLE gives unbiased results for every 5 and settles down (about N = 100) to an 

388 asymptotic variance, which is increasing with 5. Assuming /i^ and to be known is unrealistic and thus 

389 these finite sample properties are only an indication of the behaviour of the full MLE, Omle- Nevertheless the 

390 results are very remarkable for the extremely heavy-tailed data {5 > 1), for which typically statistical methods 

391 break down. One explanation in this behaviour lies in the particular form of the likelihood function (34) and 

392 its gradient (37) (Theorem 4.1). Although both depend on z, they only do so via Ws (z) = u ^ A/'(0, 1). 

393 Hence as long as (5 is in a sufficiently small neighborhood of the true 5 (34) and (37) are functions of almost 

394 Gaussian RVs and thus standard asymptotic results still apply. 

395 5.2 Estimating all parameters jointly 

396 Here we consider the more realistic scenario where also and ax are unknown. Similarly to the previous 

397 section, we consider various sample sizes [N = 50, 100, 250, and 1000) and different degrees of heavy tails, 

398 5 e {0,1/10,1/3,1,1.5}, each one representing a particularly interesting situation: i) Gaussian data (does 

399 additional - superfluous - estimation of 5 affect other estimates?), ii) slightly heavy-tailed data, all moments 

400 < 10 exist, iii) fourth moments do not exist anymore, iv) non-finite variance, v) non-existing mean, vi) 

401 extremely heavy-tailed data (does the MLE provide useful estimates at all?). 

402 Remark 5.1 (Computational problems with Lambert's W function). For the joint estimation, numerical 

403 overflow problems became much more frequent in the estimation if the true 5 was set to larger values than 2. 

404 Hence the largest 5 considered here is 6 — 1.5. While this issue has been reported to the authors of the gsl 

405 package, it still remains unresolved. 

406 Again optimization is restricted to S > for Lambert W MLE and LGMM. Due to numerical problems in 

407 IGMM, it is also restricted to a search in 6 < 10. 
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Lambert W MLE 



N 



50 
100 
250 
1000 



0.00 
0.00 
0.00 
0.00 



0.00 
0.00 
0.00 
0.00 



0.98 
0.99 
1.00 
1.00 



0.00 0.D7 0.02 

0.00 0.98 0.01 

0.00 0.99 0.01 

0.00 0.99 0.00 



0.99 
1.00 
1.00 
1.00 



0.00 
0.00 
0.00 
0.00 



0.96 
0.97 
0.98 
0.99 



0.02 
0.01 
0.01 
0.00 



0.98 
0.99 
1.00 
1.00 



60 
100 
250 
1000 



0.50 
0.50 
0.51 
0.50 



0.50 
0.51 
0.51 
0.49 



0.57 
0.56 
0.52 
0.52 



0.51 O.eO 0.66 

0.51 0.62 0.62 

0.51 0.59 0.59 

0.49 0.62 0.56 



0.54 
0.53 
0.50 
0.52 



0.51 
0.52 
0.51 
0.49 



0.65 
0.65 
0.62 
0.63 



0.66 
0.62 
0.59 
0.56 



0.56 
0.56 
0.52 
0.52 



50 
100 
250 
1000 



1.24 
1.25 
1.26 
1.26 



1.01 
1.02 
0.98 
0.98 



0.72 
0.70 
0.71 
0.73 



1.01 0.76 0.21 

1.02 0.76 0.23 
0.98 0.77 0.22 
0.98 0.79 0.22 



0.73 
0.70 
0.71 
0.73 



1.02 
1.03 
0.98 
0.98 



0.78 
0.78 
0.79 
0.79 



0.26 
0.26 
0.24 
0.22 



0.72 
0.70 
0.71 
0.73 



(a) Truly Gaussian data: 5 = 



S = 1/10 
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(b) Slightly heavy-tailed data: 5 = 1/10 
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(c) No fourth moments: 5 = 1/3 



Lambert W MLE 



100 
250 
1000 



0.00 
0.00 
0.00 
0.00 



-0.10 
0.74 
11.03 
3.84 



24.6 
72.4 
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348.1 
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0.00 
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(d) Non-existing mean: S = 1 
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-0.01 
0.00 
0.00 
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1.00 
0.53 
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0.54 
0.51 
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0.52 
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1.78 2.84 
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1.80 2.85 



NA 
NA 
NA 
NA 



0.01 
0.00 
0.00 
0.00 
0.01 
0.00 
0.00 
0.00 



0.01 
0.00 
0.00 
0.00 



(e) Extreme heavy tails: S = 1.5 

Table 2: Based on n — 1,000 replications each. In each sub-table: (first rows) average, (middle rows) 
proportion of estimates below true value, (bottom rows) empirical standard deviation times -\/iV- 



For IGMM the tolerance level was set to tol — 1.22 ■ 10 ^ and the Euclidean norm was used. Table 2 
summarizes the simulation. Each sub-table is organized as follows: the columns correspond to the parameter 
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410 estimates of each method; the three main rows are the average over n = 1,000 rephcations (top), the 

411 proportion of values below the true parameter (middle), and the empirical standard deviation of the parameter 

412 estimate around its empirical average times ^/N - not around the truth (bottom). Note that the classic 

413 Gaussian MLE estimates ay directly, while IGMM and the Lambert W x Gaussian MLE give estimates for 5 

414 and ax, which implicitly give an estimate of ay through the functional relation ay{5, a^) — a^- , ^ = (see 

V (1 — 2(5)3/2 

415 (21)). For a fair comparison between Gaussian MLE and Lambert W estimators each sub-table also includes 

416 a column for this implied Lambert W estimate a,. — a^ ■ , ^ _ = . Some entries in the a,, column contain 

^ ^ V(i-2?)3/2 y 

417 "oo", even for (5 which still imply finite variance. This can occur if at least one of the 1,000 estimates for 8 

418 is greater or equal to 1/2. In this case the implied estimate ay — oo and thus the average over all ri = 1, 000 

419 estimates is also "oo". 

420 The location parameter is the same for the input as for the output for any d < 1, iix = fJ-y, thus they can 

421 be directly compared. For 6 > 1, the mean does not exist; the values reported in each sub-table for these 6 

422 interpret fiy as the median. 

423 Gaussian data: ^ = 0. This parameter choice investigates if imposing the Lambert W framework, even 

424 though its use is superfluous, causes a quality loss in the estimation of the mean fj,y = ii^ or standard 

425 deviation ay — ax- Furthermore, critical values can be obtained for the null hypothesis Hq : S ~ Q (Gaussian 

426 tails). Table 2a shows that under Gaussianity all estimators are unbiased and quickly tend to a large-sample 

427 variance. The last four rows show that the sample median can not outperform the sample mean or the 

428 Lambert W x Gaussian estimators to estimate the location parameter fj,x, as it has a much larger standard 

429 deviation. Additional estimation of 6 does not affect the efficiency of fix compared to estimating /i only (both 

430 for IGMM and Lambert W x Gaussian MLE) . Estimating ay directly by Gaussian MLE does not give better 

431 results than estimating it via the Lambert W x Gaussian MLE: both are unbiased and have similar standard 

432 deviation. 

433 Slightly heavy-tailed: S = 1/10. Here the RV Y has slight excess kurtosis (3 -I- 2.5082) and ay{S,ax = 

434 1) — 1.182. The Lambert W estimators provide unbiased r, and already for this small degree of heavy tails 

435 have smaller empirical standard deviation for the location parameter than the Gaussian MLE or the median. 

436 Also using Lambert W estimators does not give worse estimates for ay. 

437 No fourth moment: 5 = 1/3. The true standard deviation of the output equals ay{S,ax = 1) = 2.2795. 

438 For this degree of heavy tails fourth moments do not exist anymore, which reflects in an increasing empirical 

439 standard deviation of ay as N grows. In contrast, estimates for ax are not drifting off. In presence of these 

440 large heavy tails the median outperforms the Gaussian MLE and IGMM as it is much less variable than the 

441 latter two. However, it is not as good as the Lambert W x Gaussian MLE for fix- 

442 Non-existing mean: 6 = 1. Here not only the standard deviation but also the mean is non-finite. Thus 

443 both sample moments diverge, and their empirical standard deviation is also growing very quickly with ^/N. 

444 The median still provides a very good estimate for the location, but is again inferior to both Lambert W 

445 estimators, which are unbiased and seem to converge to an asymptotic variance at rate 



22 



446 Extreme heavy tails: S = 1.5 As in Section 5.1, IGMM and Lambert W MLE still provide unbiased 

447 estimates, even though the data is extremely heavy-tailed. The Lambert W MLE is not only unbiased but 
44B also has the smallest empirical standard deviation amongst all alternatives. In particular, the Lambert W 

449 MLE for the location has an approximately 20% lower standard deviation than the median. 

450 The last column shows that for some N about 1% of the n — 1, 000 simulations generated invalid likelihood 

451 values (NA and oo) since the search for the optimal S lead into regions which lead to a numerical overflow in 

452 the evaluation of Ws{z). For an informative summary table, these few cases were omitted and new simulations 

453 added until a full n = 1,000 finite estimates were found. Since this only happened in 1% of the cases and 

454 also such heavy-tailed data is rarely encountered in practice, this numerical issue is not a real limitation for 

455 statistical practice. 

456 5.3 Discussion of the simulation study 

457 This simulation study confirms well-known facts about the sample mean, standard deviation, and median 

458 and compares them to finite sample properties of the two Lambert W estimators. The median is known to 

459 be a robust estimate of location, which shows here as its quality does not depend on the thickness of the 

460 tails. 

461 IGMM is unbiased estimator for r independent of the value of 6. As expected the Lambert W MLE for 

462 9 has the best properties: it is unbiased for all S, and has the same empirical standard deviation as the 

463 Gaussian MLE for small 6, and lower empirical standard deviation than the median for large S. For (5 = it 

464 performs as well as the classic sample mean and standard deviation. Hence additional estimation of 6 does 

465 not affect the quality of the estimates for location and scale if S is small, but greatly improves the inference 

466 for large S. 

467 Thus the only advantage of estimating fiy and ay by the sample mean and standard deviation of y is its 

468 speed; otherwise the Lambert W MLE is at least as good as Gaussian MLE and clearly outperforms it for 

469 heavy-tailed data. 

470 6 Applications 

471 Tukey's h distribution has already proven useful to model heavy-tailed data, but estimation has been limited 

472 to quantile fitting or methods of moments estimators (Field, 2004; Fischer, 2010; Todd C. Headrick and Sheng, 

473 2008). Theorem 2.8 puts us in the position to compute the likelihood of the data in terms of 9 = {f3,6e,Sr) 

474 and estimate 9 by ML. 

475 This section shows the usefulness of the presented methodology on simulated as well as real world data: 

476 Section 6.1 demonstrates the Gaussanizing' capabilities on a Cauchy random sample from the Introduction, 

477 Section 6.2 shows that heavy tail Lambert W x Gaussian distributions provide an excellent fit to daily S&P 

478 500 log-return series, and finally Section 6.3 shows that removing heavy tails reveals hidden patterns in 

479 power-law type data. 

^Function Gaussianise in the LambertW package. 
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480 6.1 Estimating the location of a Cauchy using the sample mean 

481 It is well-known that the sample mean y is a poor estimate of the location parameter of a Cauchy distribution, 

482 since the sampling distribution of y is again a Cauchy; in particular, its variance does not go to for n — > oo. 

483 Heavy-tailed Lambert W x Gaussian distributions have similar properties to a Cauchy for 6^1. The mean 

484 of X equals the location of Y, due to symmetry around fj,x (for all S > 0) and c, respectively. Thus we can 

485 estimate r from the Cauchy sample y and transform it to xp, estimate fi^ from x,-, and thus automatically 

486 get an estimate of the location c from the Cauchy distribution we started with. 

487 The data y ^ C(0, 1) in Fig. 2a shows heavy tails with two very extreme (positive) samples. A maximum 

488 likelihood (ML) fit using a Cauchy distribution* gives c — 0.028(0.055) and scale estimate s"= 0.864(0.053), 

489 where standard errors are given in parenthesis. A Lambert W x Gaussian MLE gives 'jl^ = 0.030(0.0547), 

490 = 1.054(0.0717), and S = 0.861(0.0819). Thus both fits correctly fail to reject = c = 0. Table 3a 

491 shows summary statistics on both samples. Since the Cauchy distribution does not have a well-defined mean, 

492 y = 2.304(2.101) is not a useful estimate. However, the sample average xy^j^^^ = 0.033(0.0472) correctly 

493 fails to reject a zero location for y. The Gaussianised version xy^^^^ also features additional characteristics 

494 of a Gaussian sample (symmetric, no excess kurtosis), and even the null hypothesis of Normality cannot be 

495 rejected (p- value > 0.5). 

496 

497 Figure 2c shows the running sample average for the original sample and the Gaussianised version. For a fair 

498 comparison I re-fit the model cumulatively for each n = 5, . . . , 500, and then Gaussianise (j/i, . . . , z/n) using 

499 the estimate t'^m\e- Even for small sample sizes the Gaussianising transformation works extremely well: the 

500 highly influential point around n 50 greatly affects y, but has no relevant effect on x_(„) . Overall, the 

'''mle 

501 sample average of the Gaussianised data has all the nice properties of the sample mean, and even for very 

502 small sample sizes it is already clear that the location of the underlying Cauchy distribution is approximately 

503 zero. 

504 

505 Although being a toy example, it shows that removing (strong) heavy tails from data works and provides 

506 new "nice" data which can then be used for more refined models. 

507 6.2 Modelling heavy tails in financial return series: S&P 500 case study 

508 A lot of financial data displays negative skewness and excess kurtosis. Since financial data is in general not 

509 i.i.d. it is often modelled with a (skew) student-t distribution underlying a (generalized) auto-regressive 

510 conditional heteroskedastic (GARCH) (Bollcrslcv, 1986; Engle, 1982) or a stochastic volatility (SV) model 

511 (Deo, Hurvich, and Lu, 2006; Melino and Turnbull, 1990). Using the Lambert W approach we can build 

512 upon the knowledge and implications of Gaussianity (and avoid deriving properties of a GARCH or SV model 

513 with heavy-tailed, skewed innovations), and simply "Gaussianise" the data y before fitting more complex 

514 models (e.g. GARCH or SV models). Time series models based on Lambert W x Gaussian white noise are 

515 far beyond the scope and focus of this work, but can be a direction of future research. For this exploratory 

516 analysis I consider the unconditional distribution only. 

517 

^Function f itdistr in R package MASS. 
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Min 
Max 
Mean 
Median 
Stdev 
Skewness 
Kurtosis 



Shapiro- Wilk 
Anderson-Darling 



y ^'^M L E 






-161.59 -3.16 
952.95 3.81 
2.30 0.03 
0.04 0.04 
46.980 1.06 
17.43 0.12 
343.34 3.21 


-7.113 -2.421 
4.989 2.229 
0.046 0.051 
0.042 0.042 
0.948 0.705 

-0.296 -0.039 
7.702 2.925 


20 20 
231300 157 
689.4 89.0 
87 87 
6520.6 27.0 
22.2 0.1 
582.1 1.9 


* 0.71 
** 0.51 


* 0.241 

* 0.181 





(a) y~C(0,l) 
(Section 6.1) 



(b) y = S&P 500 
(Section 6.2) 



(c) y = solar flares 
(Section 6.3) 



Table 3: Summary statistics for observed (heavy-tailed) y and back-transformed (Gaussianised) data xy^^^^^. 
** stands for < 10^^^ * for < 2.2 • IQ-^^. 



518 Figure 7a shows the S&P 500 log-returns with a total of iV = 2, 780 daily observations. Table 3c con- 

519 firms the heavy tails (sample kurtosis 7.70), whereas it indicates negative skewness (—0.296). As the sample 

520 skewness 71 (y) is very sensitive to outliers in the tails, we should test for symmetry by fitting a skewed 

521 distribution and testing its skewness parameter(s) for zero. In case of the double-tail Lambert W x Gaussian 

522 this means to test Hq : Si = 6r versus Hi : 5i ^ 5r- Since the likelihood can now be computed by (30), we can 

523 do this using a likelihood ratio test with one degree of freedom (3 versus 4 parameters). The log-likelihood of 

524 the double-tail Lambert W x Gaussian fit (Table 4a) equals -3606.005 = -2972.276 + (-633.7287) (input 

525 -I- penalty) and using only one 5 for both sides it equals —3606.554 ~ —2972.276 + (—633.7287) (input + 

526 penalty). Comparing twice their difference to a distribution with one degree of freedom gives a p- value 

527 of 0.29. Additionally, I fit^° a skew-t distribution (Azzalini and Capitanio, 2003), with location c, scale s, 

528 shape a, and degrees of freedom v. Also here a is not significantly different from zero (Table 4b). Thus in 

529 both cases we can not reject symmetry. 

530 

531 Assume we have to make a decision if we should trade a certificate replicating the S&P 500. Since we can 

532 either buy or sell, it is not important if the average return is positive or negative, as long as it is significantly 

533 different from zero. 

534 6.2.1 Gaussian fit to observed data 

535 If we ignore heavy tails and estimate {^iy,Oy) by Gaussian MLE, fly = can not be rejected on a a = 1% 

536 level (Table 4c) . As OLS over-estimates the variance in presence of heavy tails, this test conclusion should 

537 not be trusted. 

538 6.2.2 Heavy tail fit to observed data 

539 If we account for heavy tails, we can impose any heavy-tailed location-scale distribution, estimate its param- 

540 eters jointly, and then test the location parameter being equal to zero. The standard errors can be computed 

package MASS, datasct SP500. 
^"Function st.mle in the R package sn. 
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(a) Observed heavy tail returns y (b) Back-transformed Gaussian input xy^^^^^ 

Figure 7: Lambert W analysis of the S&P 500 log- returns (in %). In both sub- figures: (top left) data; (top 
right) autocorrelation function (ACF); (bottom left) histogram, Gaussian fit, and kernel density estimates; 
(bottom right) Normal QQ plot. 



541 by the inverse of the numerically obtained Hessian of the log-likelihood. Both the heavy tail Lambert W x 

542 Gaussian (Table 4c) and student-^ fit (Table 4d) lead to a rejection of the zero mean null (p-values, 0.0001 

543 and 0.00003, respectively). The standard errors for the location parameter are essentially the same, which 

544 supports the claim of a "true" standard error of 3.65 for the location parameter. 

545 

546 Although location and scale estimates are almost identical, the estimates describing the tails lead to very 

547 different conclusions: while for D — 3.71 only moments up to order 3 exist, in the Lambert W x Gaussian 

548 case moments up to order 5 exist (1/0.172 = 5.81). This is especially noteworthy as many theoretical results 

549 in the (financial) time series literature rely on the assumption of finite fourth moments (Mantegna and 

550 Stanley, 1998; Zadrozny, 2005); consequently empirical studies try to test if financial data actually satisfy 

551 this assumption (Cont, 2001; Huisman, Koedijk, Kool, and Palm, 2001). For this particular dataset student's 

552 t and Tukey's h distribution give different empirical answers to the same question. Since empirical analysis 

553 are often based on student's t distribution (Wong et al., 2009), it might be worthwhile to re-examine their 

554 findings in light of heavy tail Lambert W x Gaussian distributions. 

555 6.2.3 "Gaussianising" the observed data 

556 A typical statistical analysis to estimate distribution parameters would conclude here. Using Lambert's W 

557 function we can analyse the input xy^^^^^ to test if a Lambert W x Gaussian distribution is indeed appropriate. 

558 Figure 7b shows that the back-transformed data xy^^^^^ is indistinguishable from a Gaussian sample. Not 

559 even one Normality test on xfj^^^- ^ (Anderson Darling, Cramer-von-Mises, Shapiro-Francia, Shapiro- Wilk; see 

560 Thode (2002)) can reject Gaussianity: p-values are 0.181, 0.184, 0.311, and 0.241, respectively. Table 3c also 

561 shows that Lambert W "Gaussianiziation" was successful: empirical kurtosis is close to 3, and although the 

562 sample skewness is still negative, a value of —0.039 is within the typical variation for a Gaussian sample. 
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Est. 


Std. Err. 


t 


Fr(>| t 1) 




rjSt. 


Std. Err. 


t 


Pr(> t \) 


fix 0.055 


0.015 


3.660 


0.000 


c 


0.101 


0.061 


1.647 


0.100 


ax 0.705 


0.016 


44.004 


0.000 


s 


0.669 


0.017 


38.466 


0.000 


5e 0.185 


0.021 


8.968 


0.000 


a 


-0.078 


0.101 


-0.773 


0.440 


Sr 0.159 


0.019 


8.238 


0.000 


1/ 


3.729 


0.297 


12.567 


0.000 


(a) double-tail Lambert W X Gaussian 


= Tukey's hh 






(b) skew t (S&P 500) 
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Std. Err. 
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Pr(>| t 1) 


fix 0.055 


0.015 


3.653 


0.000 


c 


0.055 


0.015 


3.650 


0.000 


ax 0.705 


0.016 


43.951 


0.000 


s 


0.667 


0.017 


39.505 


0.000 


S 0.172 


0.016 


11.046 


0.000 


V 


3.716 


0.295 


12.612 


0.000 


(c) Lambert W 


X Gaussian 


= Tukey'f 


5 h (S&P 500) 




(d) student-t (S&P 500) 




Est. 


Std. Err. 


t 


Pr(>l t 1) 




Est. 


Std. Err. 


t 


Pr(>l t 1) 


fly 0.046 


0.018 


2.546 


0.011 


fJ-Xf 


0.051 


0.013 


3.805 


0.000 


ay 0.948 


0.013 


74.565 


0.000 


ax^ 


0.705 


0.009 


74.565 


0.000 


(e) 


Gaussian (S&P 500) 








(f) Gaussian (xy^^j^^) 





Table 4: Fitting distributions by MLE to S&P 500 y (a, b, c, d, e) and the latent data xy^^^ (f). 

Thus the heavy-tailed Lambert W x Gaussian (= Tukey's h) distribution 

y= (f/e^^') 0.705 + 0.055, U= ^~^^f\ U ^ M{0,1) (40) 

563 is an adequate (unconditional) model for the S&P 500 log-returns y. For the trading decision this means 

564 that the expected return is significantly different from zero, and thus a trading certificate should be bought 

565 (/I^ = 0.055 > 0). 

566 6.2.4 Gaussian MLE for Gaussianised data 

567 For 6i — 6r = S < 1, the expectation of the input equals the expectation of the output. We can therefore 

568 replace the original test of = versus fiy ^ for a non-Gaussian sample y, with the very well understood 

569 hypothesis test fi^ = Q versus fj.^ for the Gaussian sample xy^j-^^. In particular, standard errors based on 

570 - and thus t and p- values - should be closer to the "truth" (Table 4c and 4d) than a Gaussian MLE on the 

571 non-Gaussian y (Table 4e). Table 4f shows that the standard errors for /ix are indeed much closer; they are 

572 even a little bit too small compared to the heavy-tailed versions. Since the "Gaussianising" transformation 

573 was estimated, treating X5^J^JJ^^ as if it was original data is too optimistic regarding its Gaussianity (compare 

574 to the penalty term (32) in the overall likelihood (30)). 

575 

576 This example confirms that if a model and its theoretical properties are based on Gaussianity, but the 

577 observed data is heavy-tailed, then Gaussianising the data first gives more reliable inference than applying 

578 the Gaussian methods to the original, heavy-tailed data (Fig. 1). Clearly, a joint estimation of the model 

579 parameters based on Lambert W x Gaussian errors (or any other heavy-tailed distribution) would be optimal, 

580 but theoretical properties and estimation procedures may not have been derived or implemented yet, or are 

581 simply not known to those applied researchers who are non-experts in the field of skewed, heavy-tailed 
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(a) Peak X-ray intensity y. 



(b) Zoom to y < 400; horizontal line ymin = 
323 for cut-off value from power-law fit (Clauset 
ct al., 2009). Dashed horizontal lines represent ±89. 
Right: KDE for first 4,000 (red), last 2,273 (blue), 
and all (black) data points. 



(c) Bivariate KDE fit over 
y across time. For better 
visualization y < 165. 






id) Back-transformed data x^j^ , 
= (74.46,26.32,1.53) (same 
5 for both tails); horizontal line 
is the Gaussianised cut-off value 
WV^(5^min ± 89) = 114.94 and in- 
terval (110.96, 117.58). Solid hori- 
zontal line: Hx = 74.46. 



(e) Back-transformed data X;p. t = 
(86.97,26.80,0,2.37) (separate tail Modelling); 
horizontal line is the Gaussianised cut-off 
value WV(ymin) ± 89 = 121.16 and interval 
(117.74, 123.37). 



(f) Bivariate KDE fit over 
across time (no trun- 
cation in x). 



Figure 8: Peak X-ray count rates of solar flares. 



582 statistics. The Lambert Way to Gaussianise data thus provides a pragmatic method to improve statistical 

583 inference on heavy-tailed data, while still preserving the ease of usage and interpretation of Gaussian models. 

584 6.3 Removing power laws: analysis of solar flares 

585 The previous section focused on Lambert W x Fx distributions as a "true" model for the data y. Here I 

586 consider it merely as a data transformation to remove heavy tails. In the same way as scaling y to zero-mean, 

587 unit- variance data (y — y) /dy does not necessarily mean we believe the underlying process is Gaussian, we 

588 can also convert y to = Wr{y) without assuming that the data is actually Lambert W x Gaussian. While 

589 might not be as easily interpretable as the original data, it can be helpful for exploratory data analysis 

590 (EDA), as the eye is a bad judgement to detect regularities corrupted by heavy tails. Removing them can 

591 reveal hidden patterns and thus greatly improve the accuracy of statistical models for y. 

592 



28 



593 Here I study solar flare X-ray count rates (Clauset ct al., 2009; Newman, 2005). The data^^ were collected 

594 approximately four times a day from Feb. 1980 until Nov. 1989 giving T — 12, 773 observations. See Dennis, 

595 Orwig, Kennard, Labow, Schwartz, Shaver, and Tolbert (1991) for a detailed description of this dataset and 

596 its scientific background. 

597 

598 The X-ray count rates exhibits a strong right heavy tail (Fig. 8a), which makes more detailed visual 

599 inspection as well as simple exploratory analysis hard. A zoom to yi < 400 in Fig. 8b shows that a lot of 

600 values lie between 50 and 100 and this level drops off at the end of the observation cycle. The drop is not an 

601 intrinsic property of the data but due to a decreasing sensitivity of the X-ray detectors over the course of 10 

602 years (Dennis et al., 1991). 

603 For the sake of comparison with (Clauset et al., 2009; Newman, 2005) most estimates reported here are 

604 based on all T = 12, 773 observations. Figures 8b, 8d, and 8c show a separate density estimate for the first 

605 4, 000 and last 2, 273 observations, and while the estimates change, the qualitative findings do not. 

606 Clauset et al. (2009) find that a power-law (a = 1.79(0.02)) with cut-off (ymin = 323(89)) gives the best fit 

607 amongst various alternatives. However, this first EDA might not be complete: not only visually heavy tails 
60S can obscure underlying non-trivial structure, but also estimates - such as the power law fit or non-parametric 

609 density estimates (Fig. 8b and 8c) - are affected by the heavy right tail. Here I show that Gaussianising this 

610 data reveals new insights for the data-generating process, with a new interpretation for the optimality of the 

611 cut-off value. 

612 An MLE fit gives r — (jlx, a^, Sg, Sr) = (86.97, 26.80, 0, 2.37), which shows that only the right tail needs a 

613 Gaussianising transformation.^^ The second row of Fig. 8 shows EDA for the Gaussianised data. Removing 

614 the heavy right tail (estimated tail index 1/2.373 = 0.421) reveals a bimodal structure. As the inverse 

615 transformation Wr{y) is monotonic in y for any fixed r, the bimodal density is an intrinsic property of the 

616 data, not an artefact from the transformation. 

The bimodal structure of the input also gives an additional meaning to ymin = 323. The Gaussianised 
cut-off value equals WV(323) = 121.16 with the transformed standard deviation interval [117.74,123.37] 
(corresponding to 323 ± 89). Fitting a two component Gaussian mixture model to xy yields 

AAAi(67.10, 14.042)-f (1-A)AA2(113.12,14.272), with A = 0.52, (41) 

617 and corresponding optimal decision boundary between the classes of 90.48. The mean of the larger compo- 
61S nent, 113.12, lies within one standard deviation of the optimal Gaussianised cut-off 121.16: for lower cut-offs 

619 the left-tail of the larger component - or for much lower cut-offs even the first component - would counteract 

620 the power-law decay of the upper X-ray count rates. 

621 

622 As already mentioned above, this analysis is not intended to describe the true underlying distribution of 

623 the X-ray data and parameter fits are not meant to be used for inference for the physics of X-rays; it should 

624 rather show new insights that can be gained by Gaussianizing the solar flare dataset. Future research based 

625 on this additional information can lead to new physical interpretations of the statistical properties X-ray 

Dataset SolarFlares in the LambertW package. Also available at http : //tuvalu. santaf e . edu/~aaronc/powerlaws/data.htm 
and http : / /umbra. nascom .nasa. gov/smm/hxrbs .html. 

^^For comparison Fig. 8d also shows the back-transformed data x^p^ using the same 5 on each tail (ri = (74.46,26.32, 1.53)). 
However, due to the clear right heavy tail I will continue with the (5;, 5, ) transformation. 
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626 count rates, see for example Aschwandcn (2010). 

627 7 Discussion and Outlook 

62S In this work I adapt the skewed Lambert W input/output framework to introduce heavy tails in continuous 

629 RVs X ^ Fx{x) and provide closed- form expressions of the cdf and pdf. For Gaussian input this not only 

630 contributes to the existing literature on Tukey's h distribution, but also gives convincing empirical results: 

631 skewed, unimodal data with heavy tails can be transformed to behave like Gaussian data/RVs. Properties of 

632 a Gaussian model Aij\f on the back-transformed data mimic the features of the "true" skewed, heavy-tailed 

633 model Mg very closely. 

634 Since Gaussianity is the single most typical, and necessary, assumption in many areas of statistics, machine 

635 learning, and signal processing, future research can take many directions. From a theoretical perspective the 

636 properties of Lambert W x Fx distributions viewed as a generalization of already well-known distributions 

637 F can be studied - ignoring the possibility to actually back-transform the data. This area will profit from 

638 the immense literature on the Lambert W function, which has been discovered only recently by the statistics 

639 community. Empirical analysis can focus on actually transforming the data and compare the performance 

640 of the approximation versus the correct, joint analysis. The simple comparisons here showed that doing the 

641 approximate inference is comparable with the direct heavy tail Modelling, and so provides an easy tool to 

642 improve inference for heavy-tailed data in statistical practice. 

643 It is also worth pointing out the educational value of this transformation based approach, as it can be 

644 taught in introductory classes and thus already early on promote heavy-tailed statistics with enduring value. 

645 

646 I also provide the R package LemibertW, publicly available at CRAN, to facilitate the use of Lambert W x 

647 Fx distributions in practice. 
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774 A Auxiliary results 



775 A.l Properties of Ws{z) 

776 The function Ws{z) is the building block of Lambert W x Fx distributions. This section lists useful properties 

777 of Ws{z) as a function of z as well as a function of 6. 

Properties A.l. For 5 = 0, 

Wsizi) \s=o= Zi, W'{5z1) |5=o= zl and W {6zf) 1^=0= 0. (42) 
By definition it also holds 

(43) 

z 

and therefore 

log ^ = --Ws{zy = (44) 

z / I 

Lemma A. 2 (Derivative of W^iz) with respect to 2;). It holds 

^W,(z) = ^'^^^ =e-5^(^^') ^ (45) 

z{l + SWsizf) l + W{Sz^) ^^'^ 

Proof. One of the many interesting properties of the Lambert W function relates to its derivative which 
satisfies 

^^(-) = ^(lf^ = e^(-)(lW(.)) ' ^''^ 



778 Hence, 



d W{6z') ^ W{5z^) 2W{Sz') 

dz 6 ^ ' (1 + ^ (^^2)) Sz{l + W{Sz^)) ^ ' 



779 Therefore, 



As W {Sz"^) = 5v? the last line simplifies to 

1 I 5u- u ^^^^ 



51/251/2^ z{l + 5v?) z{l + 5v?)' 
780 Now use again u = Wj{z). □ 
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Lemma A. 3 (Derivative of Ws{z) with respect to S). For all z G 



Proof. By definition \Wi>{z)f = "^^^^^ Tlius 



' Sz■^(l+W(Sz■^))■ 



W {Sz 
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l + VK((5z2) 



(52 



52 



(53) 
(54) 

(55) 

(56) 

(57) 
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782 



Since both terms are non-negative for all z e M the result follows. □ 

That is Ws{zY is a decreasing function in 5 for every z S M, i.e. the more we remove heavy tails the more 
gets shrinked (non-linearly) towards — \mis^ooWs{z). In particular, [^^(z)]^ < ^2 <^ ^ ^ and 



783 Z 



Lemma A. 4 (Derivative of Ws{z) with respect to 5). It holds 



Proof. 

dWs (z) ^ ,d fw (fc2) \ 

, ,1 /m^(,5z2)\"'^' 5 M^(<5z2) 

where the last line follows by Lemma A. 3. □ 
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A. 2 Properties of the penalty term \ogR{5 \ Zi) for standard Gaussian input 

For — Q and = ^ the penalty term equals (y^ = Zi) 



1 + ,5 (z,)) 



z,[l + W{5zj)] 



and logi? (5 I z,) = log l^iM _ log [i + (^^f-jj ^gg^ 
= -^^^-log[l + W^(5zf)] (66) 
Lemma A. 5 (Derivative of \ogR{5 \ z) with respect to 6). For all 5 > and all z eR 

Proof. We have 

d\ogR{S\z) 1 dWs {z) 1 2, 2 
9J =W^,(z) 85 -l + WiSz^)"^^^'^' 

^ ^ ^ ^ W^5(z)') 47F^W^'(^^')^' (69) 



Ws{z)\ 21 + W{dz^) ' J l + W{Sz^) 



l + WiSz^) \2' 

5^2(i+VF(5z2)) 



(z)" + M^'(<5z^)z^ (70) 



787 Using W^'((5z^) = c 2fVlw|-]^2-|i and re-factorising gives (67). 



□ 



789 A. 3 Properties of the Gaussian log-likelihood evaluated at Ws{z) 

790 The next lemma shows that increasing 6 always increases the input log- likelihood i{S \ us = ^^(z)) - see also 

791 Fig. 6b. For 6 —?' oo the Gaussianised goes to 0, which clearly is the maximizer of the Gaussian likelihood 

792 if /i = 0. 

Lemma A. 6 (Derivative of the Gaussian log-likelihood at Ws{z)). For all z eR and for 6 > 

^£(a.. ^ 0, a. = 1 I Wsiz)) = 2 j-^^^ [Ws iz)t > 0. (71) 
Proof. The log of the standard Gaussian pdf evaluated at the Gaussianised data Ws{z) simplifies to 

log 1 e-^[^^ = log -±= I [Ws{z)r . (72) 



The rest follows by Lemma A. 3. □ 
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B Proofs of the results in the main text 

795 B.l Inverse transformation 

796 Proof of Lemma 2. 5. Without loss of generality assume that fJ-x = and — 1 (otherwise standardize X 

797 first). Squaring (f) and multiplying by S yields 

6Z'^ = exp (73) 
The inverse of (73) is by definition Lambert's W{z) function (Roscnlicht, 1969) 

W{z) exp W{z) = z, zeC. (74) 



798 



W{z) is bijective for z >0. Since dU > for all 6 > 0, applying W{-) to (73), dividing by 6, and taking the 



799 square root gives 



800 
801 



Since exp (f t^^) > for all ^ € M and all it follows that Z = U exp ((5/2[/^) and U must have the same 
sign, which concludes the proof. □ 

B.2 Cdf and pdf 

Proof of Theorem 2. 8. By definition, 

Gviy) = P(r < y) = P Q;7exp I a, + /i, < 2/^ (76) 

C/exp i^-U^] <z]=V{U < Ws{z)) (77) 



.2 

= Fu{U <Ws{z)). (78) 

803 Taking the derivative with respect to y gives 

^GY{y\P,6) = f^{Ws{z)ax+^i.AP)■<^.^Ws{^-^^ (79) 

= f^[Ws{z)\l3)-ax-^Ws(y-^\ (80) 
ax dz \ <Jx J 

= fu{Ws{z)\(3)-^W5{z). (81) 
az 

804 Using Lemma A. 2 yields (14). □ 



38 



B.3 MLE for 5 

Lemma B.l (Derivative of the Lambert W x Gaussian log-likelihood). We have 

N 



Di5 I z) §-^iiS I z) . g ^W'iS^f) (Iw^. (z.f - (1 + ^^^) ) (82) 



2^ l + 5W^,(z,)2 l + 5iy.(z,)M 2 l + JW^,fz,)2] 



-T- 

2 ^ 1 



^ 1 + WiSz?) ( 2 + 1 + (6zf) ) ■ ^^^^ 



806 Proof. Follows by applying Lemmas A. 5 and A. 6 to ^^((5 | z) = log R{S \ z) + ^({fJ-x = 0,(Ti, = 1 | 

807 Wsiz)). □ 

Proof of Theorem 4-1- a) The log-likelihood is increasing at (5 = if and only if (set (5 = in (84) and use 
Properties A.l) 

N N 

Y.zt>3Y,zl (85) 

i=l 2=1 

808 i.e. actually transforming the data (choosing (5 > 0) increases the overall likelihood only if the data is 

809 heavy-tailed enough. It is important to point out that the sum of squares is not squared again. Hence 

810 the condition is not equivalent for the data having empirical kurtosis larger than 3. 

811 b) If (85) does not hold, then Smle must satisfy D{d \ z) from (82) in Lemma B.l. It remains 

812 to be shown that this equation has (at least) one positive solution. 

813 i) Since \\ms^aDWs{z) — for all z G M, (84) is also true in the limit; however, we can ignore this 

814 solution as we require 5mle G 

ii) By continuity and lim5_j.oo Ws{z) = it also follows that for sufficiently large 6m, M^iSm (^i) < 1 fo^' 
all Zi G M. Hence Ws^^Zi)'^ < Wsj^^iz^)^ and therefore 

ly^ Wsjz^f ^ly. Wsjz^f 
'2jr[l + 5Ws{z.f 2^^i + SWsiz,f 

^^MfOl^/l 1 \ ,,,s>Sm, (87) 



i=l 



which shows that D{5 \ z) \s>Sm'^ 0- That is D{5 \ z) is approaching from below for 5 — ?> oo. 

iii) By continuity and since D{5 \ z) |5=o> (if (85) does not hold), it must cross the D{6 \ z) ~ 0-line 
at least once in the interval (0, 5m), proving the existence of Smle- 
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c) The log-likelihood can be decomposed in 



^ (5 I z) cx - i 5: [W,{z,f + log - log [l + W (Sz^)] . (88) 

t—i t—i 

^ V ' ^ V ■' 

e{fi^=Q.a^ = l\Ws{2,)) K(5|z) 

818 Lemmas A. 5 and A. 6 show that TZ{6 \ z) is monotonically decreasing and £{fix — 0, <^x — ^ \ Ws{z)) is 

819 monotonically increasing in 6. 

820 Furthermore, lims^oo ^ifJ-x = 0, (Ta; = 1 | Ws{z)) — 0, that is the input likelihood is monotonically 

821 increasing but bounded from above (by = logl). On the other hand lim5_5.oo 7?.((5 | z) — oo showing 

822 that the penalty term is decreasing in an unbounded manner. Thus their sum has a global maximum 

823 either at the unique mode oi £{S \ z) or at the boundary 6 = 0- see also Fig. 6b. 

824 n 
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